Menu

Practical SAX Notes

August 11, 2004

Uche Ogbuji

In this article I discuss issues related to recent articles in this column, including some practical problems using XML facilities -- SAX in particular -- across Python versions and installed software configurations. I also revisit ElementTree's support for XML namespaces and discuss some other Python tools' support for breaking large documents into chunks.

XML Library Headaches

Listings 2, 3, and 4 from my last article, Decomposition, Process, Recomposition, will only work reliably if you have PyXML installed. I mistakenly said the code would work standalone with Python 2.2+. I added a comment with a link to the corrected code in the Python Cookbook.

This is the second time I've made a mistake along these lines in this column. I have several Python installations on my laptop, and I sometimes mix up which batch of libraries I have to hack to disable PyXML. The fault is all mine, but these mistakes are made all too easy by certain problems with the current state of Python's built-in XML libraries as well as PyXML. It's worth some discussion: if I, who helped develop these libraries, can get them confused, no doubt other users can as well.

Putting core SAX and DOM libraries in the Python 2.0 release was an excellent move. XML had established itself as an important technology and the addition of the xml.dom and xml.sax modules should have made it easier to process XML with Python. For a while it has been apparent that the incorporation of XML libraries into the standard library was never as comprehensive as it should have been, and I've begun to think that by cutting corners, we made it actually harder rather than easier to process XML in Python.

The XML modules came from the already established PyXML project, and the corners cut were a matter of leaving out some parts of PyXML in order to reduce the effect that adding the XML libraries had on the size of the Python distribution. You still need to install PyXML in order to gain some capabilities that in hindsight really make the difference in the usefulness of the XML library.

The slightly hobbled, built-in XML library would have been less of an annoyance without the effects of another decision that seemed a good one at the time. PyXML installs as a third-party library (in a hidden module named _xmlplus), but with some clever tricks so that it actually replaces (and masks) the top-level xml module that came with Python.

As an example, I usually disable PyXML for testing purposes by simply renaming the _xmlplus directory to _xmlplus.disable. This automagically reasserts the original XML code that came with Python. Reversing this operation restores PyXML and once again hides the standard XML library. The very fact of this neat trick indicates the sort of troublesome magic that the current arrangement brings about. Things are awkward because there is no straightforward way to determine whether you are using xml.dom and xml.sax from Python's standard library or from PyXML.

This is a problem that was predicted back when the current arrangement was established, and a debate about it reemerged in early 2003 (Cf. "Finding _xmlplus in Python 2.3a2 "). I'm not convinced, given all the complexities of the situation, that there was any better solution than the _xmlplus hack. As Jeremy Kloth explains:

Most libraries that get moved into the core are not as large as PyXML and to top it off only part of PyXML was moved into the core. This is probably the root of most of the problems. However, PyXML had dibs on the top-level "xml" before Python core did, so this deal was struck to keep developers for both sides happy. There would have been much code to change if PyXML was forced to change the package to something else, and I believe this is still the case, at least with PyXML.

On the other hand, Thomas Passin put his finger on the crux of the problem with the current arrangement:

Should not one always know which code is being used -- at least, if there is a real possibility of different behavior? If there might be some difference in behavior, one should be able to choose which one to use, or at least detect which one so it can be compensated for.

Not being able to readily make this distinction, for example, is at the root of the trouble I got into with code in my last column. I wasn't sure whether the core Python XML code was being used or the PyXML additions. And there are certainly significant differences in behavior between the two. For a long time the most painful difference was that the C Expat parser library did not come with Python, although Expat was needed for all the built-in XML capabilities. This meant that if you didn't figure out how to compile and link in Expat while building Python, the xml module became essentially useless.

Often the easiest way to fix this was to install PyXML, which does come with all the needed Expat code. All in all, things were very confusing. This situation was improved in Python 2.3, in which the Expat C code was finally included in the Python distribution and built-in by default. Some operating system distributions, confused by the former regime, actually interfered with the built-in Expat library in Python, causing lingering problems; but in all cases I am aware of, these problems have now been sorted out.

Another important difference between the standard XML library and PyXML that persists even in Python 2.3 is support for namespace-prefix reporting in SAX. If you try to enable the SAX feature xml.sax.handler.feature_namespace_prefixes in stock Python 2.3, you get a SAXNotSupportedException. This is precisely the difference that bit me in the last article.

As a result of all the above, I have a few recommendations. First, for anyone shipping code that uses the built-in XML library, make Python 2.3 a minimum requirement. Doing so will save your users a lot of confusion, but this might be too limiting for some developers. As a fallback, if you ship code using the XML libraries and you wish to support Python versions 2.0 through 2.2.x, make PyXML a prerequisite. Maybe for advanced users you can add a note that they needn't install PyXML if they are sure their base Python distribution has a working pyexpat. If you need to support namespace prefixes in SAX, make PyXML a prerequisite regardless.

XMLFilter

If PyXML is too much to make users install, a further option is to require Andrew Shearer's XMLFilter. XMLFilter is a multi-purpose package that aims to make SAX processing in Python easier and saner. To this effect it provides the following features or safety nets (based on descriptions on the XMLFilter home page):

  1. It provides an xml.sax-compatible XML parser even when the installed version of Python lacks a working copy of xml.sax, for example on Python 2.0 through 2.2.x that were built without a working Expat library). It does so by wrapping the older xmllib parser, which does work on all Python 2.x versions.
  2. It provides a SAX-like XML output module, much like xml.sax.saxutils.XMLGenerator, but safer for general 2.x Python code because XMLGenerator has a serious bug in Python 2.2 (which also bit me earlier in this column).
  3. It allows programs to hint that they want to write particular chunks of content to an XML file as CDATA, in a way that works well with other filters in a SAX chain.
  4. It provides a SAX filter framework.

If as described in (1) above, XMLFilter detects that there are no SAX parsers available and falls back to its built-in xmllib adapter, it does happen to support the xml.sax.handler.feature_namespace_prefixes. However, if there is a working SAX parser (e.g. pyexpat was properly compiled in the Python installation or you're using Python 2.3 or later), XMLFilter will fallback to using the built-in SAX module, which does not support the xml.sax.handler.feature_namespace_prefixes feature.

This means that if you need support for namespace prefixes, regardless of Python version, you still really need PyXML and XMLFilter wouldn't quite be enough. I have tried to untangle this spaghetti of decision points in the flowchart in Figure 1.

Figure 1: Navigating the minefield of recommended prerequisites for basic XML processing
Figure 1. Navigating the minefield of recommended prerequisites for basic XML processing.

Consider XMLFilter if you want to use SAX filters, especially if you want to emit XML at the end of the filter chain.

What to Do About this Mess?

Clearly this state of confusion can't be allowed continue indefinitely, but what is there to do about it? Some advocate just undoing the _xmlplus hack, and moving PyXML fully to a separate module. In other words just make it plain old xmlplus, without the leading underscore and the hijack of the xml namespace upon install. Martijn Faasen advocated this in the 2003 thread I've already mentioned. But this is probably not a good idea because it throws an even more severe fault line into the mix, and it would require a lot of modification of code, in Python, in PyXML and in third-party libraries and applications.

I think the biggest cause of the mess is the fact that the code from PyXML that made it into Python was minimized due to space concerns. I've already applauded the decision, for Python 2.4, to include CJKCodecs, a unified unicode codec set for Chinese, Japanese, and Korean encodings. I think this is a great idea even though CJKCodecs are huge and take up more space than all of PyXML. Continuing with my bias toward utility, even if it costs space, I think the most useful bits of PyXML should be moved into Python in their entirety. This could be done by merging in PyXML and leaving out the following:

  • 4DOM (probably obsolete with the availability of pxdom).
  • xmlproc (could be installed separately).
  • Python 1.5.2 compatibility code (the unicode module).
  • The TREX schema implementation (probably obsolete with the emergence of RELAX NG).
  • The xpath and xslt modules (out-of-date versions of the modules from 4Suite).
  • The marshalling modules (I'd argue better third-party alternatives are available).
  • qp_xml (neglected and no longer really useful).

Notice my point about 4DOM. It's worth reiterating as another recommendation: Don't use 4DOM anymore. That is, don't use the code that results from invoking code in the xml.dom.ext.reader module. Minidom is a reasonable default DOM implementation. If you need more speed or less greedy memory usage, try 4Suite's Domlettes. If you want strict DOM conformance, use pxdom. If you're feeling adventurous enough to avoid DOM altogether, try ElementTree, one of the data bindings I've covered, or PyRXPU (but not PyRXP).

Besides space, another factor behind the decision not to move all of PyXML into Python was the fact that PyXML could be updated more frequently than Python as a whole, allowing for quicker bug fixes and feature additions. I think this is no longer much of an issue now that Python has settled into a regular and fairly short release cycle.

The main obstacle to making this happen is the lack of a clear owner who can take charge of the state of all things XML in the Python standard library. Many people have generously donated time to Python XML development, but no obvious candidates present themselves who happen to have the available time or sponsorship to lead and maintain a merger of PyXML into Python. It's probably too late for this to be done in time for Python 2.4, but perhaps Python 2.5 is within reach.

ElementTree, Namespaces and Techniques for Large Documents

The ever entertaining, ever resourceful Fredrik Lundh has blogged some interesting rejoinders to my recent articles. First of all, he commented on my difficulties dealing with namespace prefixes in ElementTree. He pointed out an article on how to deal with the fact that SOAP uses qualified names in context, and how to get ElementTree to work with this complexity.

He is certainly correct in that uses of namespace that require tracking of prefixes violate the spirit of the XML namespaces specifications. In fact, a lot of XML experts have come to castigate the very widespread practice known as "qualified names in content." For my part, although I see a lot of problems exposing prefixes within content, I'm not sure I have seen any superior alternative short of completely redesigning XML namespaces, which seems hardly realistic at this point. At any rate, in my article I did point out that ElementTree was well within its standard-compliance rights in ignoring prefix information. I covered prefixes as much as I could because they remain a persistent nuisance in many XML processing tasks.

The prefix-aware code Lundh originally linked to was too SOAP-specific to really help ElementTree users in the general case, but in a more recent article he comes up with a more useful ElementTree add-on. He corrects the code I used in my original article so that it does handle full namespace scoping properly. With this fix, my utility for ElementTree (the analyze_clark_name function) allows users to deal with XML prefixes.

Lundh also pointed out a neat little ElementTree recipe for the same sort of DOM chunking that I presented in my last article. I would like to mention that the advantages of sax2dom_chunker.py include:

  1. No need to install any third-party packages (at least once the corrections I mention in this article are made).
  2. Support for any DOM implementation that meets the Python DOM conventions.
  3. Simple, declarative way to specify the chunk boundaries, rather than having the user write procedural code to for this. Full XPattern support would magnify this advantage.

Sean McGrath also pointed out some aspects of "sadly neglected" projects of his that provided similar assistance for those dealing with huge documents.

Wrap Up

    

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

It's really unfortunate that the state of XML modules in the Python standard library is so brittle and inconsistent. I hope that I and others can soon marshal the resources to clean things up in a coming Python version. If you (or anyone you know) are interested in contributing to Python and have a solid understanding of either Python or XML, consider contributing efforts toward PyXML, with the eventual goal of merging it more closely into the Python distribution.

Turning to those who are already active on the Python-XML front, Fred Drake announced the release of Expat 1.95.8, which fixes some minor bugs and adds support for suspend/resume. "Handlers can now request that a parse be suspended for later resumption or aborted altogether." See the announcement.

I made the first public release (0.5.0) of Scimitar, an implementation of ISO Schematron that compiles a Schematron schema into a Python validator script, making it a more efficient and somewhat more flexible approach than the usual XSLT implementations. See the announcement.

lxml, the alternative Python binding for libxml I mentioned in my last article, has moved here. There is also an lxml mailing list. No meaningful postings yet, nor any packaged releases of the code, but this is a project worth watching.

Eric van der Vlist announced his OSCON paper XML Driven Classes, which discusses an alternative XML data-binding he has been working on. Eric also tells me Guido had a lightning talk at OSCON about a Python/XML data-binding of his own, but I've been unable to find any more information on this. These days data bindings are sprouting like May blossoms. Soon we'll be at the point where we can start to consider consolidation, but for now, competition is good.