Practical SAX Notes
In this article I discuss issues related to recent articles in this column, including some practical problems using XML facilities -- SAX in particular -- across Python versions and installed software configurations. I also revisit ElementTree's support for XML namespaces and discuss some other Python tools' support for breaking large documents into chunks.
Listings 2, 3, and 4 from my last article, Decomposition, Process, Recomposition, will only work reliably if you have PyXML installed. I mistakenly said the code would work standalone with Python 2.2+. I added a comment with a link to the corrected code in the Python Cookbook.
This is the second time I've made a mistake along these lines in this column. I have several Python installations on my laptop, and I sometimes mix up which batch of libraries I have to hack to disable PyXML. The fault is all mine, but these mistakes are made all too easy by certain problems with the current state of Python's built-in XML libraries as well as PyXML. It's worth some discussion: if I, who helped develop these libraries, can get them confused, no doubt other users can as well.
Putting core SAX and DOM libraries in the Python 2.0 release was an
excellent move. XML had established itself as an important technology
and the addition of the xml.dom and xml.sax
modules should have made it easier to process XML with Python. For a
while it has been apparent that the incorporation of XML libraries
into the standard library was never as comprehensive as it should have
been, and I've begun to think that by cutting corners, we made it
actually harder rather than easier to process XML in Python.
The XML modules came from the already established PyXML project, and the corners cut were a matter of leaving out some parts of PyXML in order to reduce the effect that adding the XML libraries had on the size of the Python distribution. You still need to install PyXML in order to gain some capabilities that in hindsight really make the difference in the usefulness of the XML library.
The slightly hobbled, built-in XML library would have been less of
an annoyance without the effects of another decision that seemed a
good one at the time. PyXML installs as a third-party library (in a
hidden module named _xmlplus), but with some clever tricks
so that it actually replaces (and masks) the
top-level xml module that came with Python.
As an
example, I usually disable PyXML for testing purposes by simply
renaming the _xmlplus directory
to _xmlplus.disable. This automagically reasserts the
original XML code that came with Python. Reversing this operation
restores PyXML and once again hides the standard XML library. The
very fact of this neat trick indicates the sort of troublesome magic
that the current arrangement brings about. Things are awkward because
there is no straightforward way to determine whether you are
using xml.dom and xml.sax from Python's
standard library or from PyXML.
This is a problem that was predicted
back when the current arrangement was established, and a debate about it reemerged in early 2003 (Cf. "Finding _xmlplus in Python
2.3a2 "). I'm not convinced, given all the complexities of the
situation, that there was any better solution than
the _xmlplus hack. As Jeremy Kloth explains:
Most libraries that get moved into the core are not as large as PyXML and to top it off only part of PyXML was moved into the core. This is probably the root of most of the problems. However, PyXML had dibs on the top-level "xml" before Python core did, so this deal was struck to keep developers for both sides happy. There would have been much code to change if PyXML was forced to change the package to something else, and I believe this is still the case, at least with PyXML.
On the other hand, Thomas Passin put his finger on the crux of the problem with the current arrangement:
Should not one always know which code is being used -- at least, if there is a real possibility of different behavior? If there might be some difference in behavior, one should be able to choose which one to use, or at least detect which one so it can be compensated for.
Not being able to readily make this distinction, for example, is at
the root of the trouble I got into with code in my last column. I
wasn't sure whether the core Python XML code was being used or the
PyXML additions. And there are certainly significant differences in
behavior between the two. For a long time the most painful difference
was that the C Expat parser library did not come with Python, although
Expat was needed for all the built-in XML capabilities. This meant
that if you didn't figure out how to compile and link in Expat while
building Python, the xml module became essentially
useless.
Often the easiest way to fix this was to install PyXML, which does come with all the needed Expat code. All in all, things were very confusing. This situation was improved in Python 2.3, in which the Expat C code was finally included in the Python distribution and built-in by default. Some operating system distributions, confused by the former regime, actually interfered with the built-in Expat library in Python, causing lingering problems; but in all cases I am aware of, these problems have now been sorted out.
Another important difference between the standard XML library and
PyXML that persists even in Python 2.3 is support for namespace-prefix
reporting in SAX. If you try to enable the SAX
feature xml.sax.handler.feature_namespace_prefixes in
stock Python 2.3, you get a SAXNotSupportedException.
This is precisely the difference that bit me in the last article.
As a result of all the above, I have a few recommendations.
First, for anyone shipping code that uses the built-in XML library,
make Python 2.3 a minimum requirement. Doing so will save your users
a lot of confusion, but this might be too limiting for some
developers. As a fallback, if you ship code using the XML libraries
and you wish to support Python versions 2.0 through 2.2.x, make PyXML
a prerequisite. Maybe for advanced users you can add a note that they
needn't install PyXML if they are sure their base Python distribution
has a working pyexpat. If you need to support namespace prefixes in
SAX, make PyXML a prerequisite regardless.
If PyXML is too much to make users install, a further option is to require Andrew Shearer's XMLFilter. XMLFilter is a multi-purpose package that aims to make SAX processing in Python easier and saner. To this effect it provides the following features or safety nets (based on descriptions on the XMLFilter home page):
xml.sax-compatible XML parser even when the
installed version of Python lacks a working copy of xml.sax, for
example on Python 2.0 through 2.2.x that were built without a working Expat library). It does so by wrapping the older xmllib parser,
which does work on all Python 2.x versions.xml.sax.saxutils.XMLGenerator, but safer for general 2.x Python code because XMLGenerator has a serious bug in Python 2.2 (which also bit
me earlier in this column).If as described in (1) above, XMLFilter detects that there are no
SAX parsers available and falls back to its built-in xmllib adapter,
it does happen to support the xml.sax.handler.feature_namespace_prefixes. However,
if there is a working SAX parser (e.g. pyexpat was properly compiled
in the Python installation or you're using Python 2.3 or later),
XMLFilter will fallback to using the built-in SAX module, which
does not support
the xml.sax.handler.feature_namespace_prefixes feature.
This means that if you need support for namespace prefixes, regardless of Python version, you still really need PyXML and XMLFilter wouldn't quite be enough. I have tried to untangle this spaghetti of decision points in the flowchart in Figure 1.
![]() Figure 1. Navigating the minefield of recommended prerequisites for basic XML processing. |
Consider XMLFilter if you want to use SAX filters, especially if you want to emit XML at the end of the filter chain.
Clearly this state of confusion can't be allowed continue
indefinitely, but what is there to do about it? Some advocate just
undoing the _xmlplus hack, and moving PyXML fully to a
separate module. In other words just make it plain
old xmlplus, without the leading underscore and the
hijack of the xml namespace upon install. Martijn Faasen
advocated this in the 2003 thread I've already mentioned. But this is
probably not a good idea because it throws an even more severe
fault line into the mix, and it would require a lot of modification of
code, in Python, in PyXML and in third-party libraries and
applications.
I think the biggest cause of the mess is the fact that the code from PyXML that made it into Python was minimized due to space concerns. I've already applauded the decision, for Python 2.4, to include CJKCodecs, a unified unicode codec set for Chinese, Japanese, and Korean encodings. I think this is a great idea even though CJKCodecs are huge and take up more space than all of PyXML. Continuing with my bias toward utility, even if it costs space, I think the most useful bits of PyXML should be moved into Python in their entirety. This could be done by merging in PyXML and leaving out the following:
Notice my point about 4DOM. It's worth reiterating as another
recommendation: Don't use 4DOM anymore. That is, don't use the code that results from invoking code in the xml.dom.ext.reader module. Minidom is a reasonable default DOM implementation. If you
need more speed or less greedy memory usage, try 4Suite's Domlettes.
If you want strict DOM conformance, use pxdom. If you're feeling
adventurous enough to avoid DOM altogether, try ElementTree, one of
the data bindings I've covered, or PyRXPU (but not PyRXP).
Besides space, another factor behind the decision not to move all of PyXML into Python was the fact that PyXML could be updated more frequently than Python as a whole, allowing for quicker bug fixes and feature additions. I think this is no longer much of an issue now that Python has settled into a regular and fairly short release cycle.
The main obstacle to making this happen is the lack of a clear owner who can take charge of the state of all things XML in the Python standard library. Many people have generously donated time to Python XML development, but no obvious candidates present themselves who happen to have the available time or sponsorship to lead and maintain a merger of PyXML into Python. It's probably too late for this to be done in time for Python 2.4, but perhaps Python 2.5 is within reach.
The ever entertaining, ever resourceful Fredrik Lundh has blogged some interesting rejoinders to my recent articles. First of all, he commented on my difficulties dealing with namespace prefixes in ElementTree. He pointed out an article on how to deal with the fact that SOAP uses qualified names in context, and how to get ElementTree to work with this complexity.
He is certainly correct in that uses of namespace that require tracking of prefixes violate the spirit of the XML namespaces specifications. In fact, a lot of XML experts have come to castigate the very widespread practice known as "qualified names in content." For my part, although I see a lot of problems exposing prefixes within content, I'm not sure I have seen any superior alternative short of completely redesigning XML namespaces, which seems hardly realistic at this point. At any rate, in my article I did point out that ElementTree was well within its standard-compliance rights in ignoring prefix information. I covered prefixes as much as I could because they remain a persistent nuisance in many XML processing tasks.
The prefix-aware code Lundh originally linked to was too
SOAP-specific to really help ElementTree users in the general case,
but in a more recent article he comes up with a more useful ElementTree add-on.
He corrects the code I used in my original article so that it does handle full namespace scoping properly.
With this fix, my utility for ElementTree (the analyze_clark_name function) allows users to deal
with XML prefixes.
Lundh also pointed out a neat little ElementTree recipe for the same sort of DOM chunking that I
presented in my last article. I would like to mention that the
advantages of sax2dom_chunker.py include:
Sean McGrath also pointed out some aspects of "sadly neglected" projects of his that provided similar assistance for those dealing with huge documents.
Also in Python and XML | |
Should Python and XML Coexist? | |
It's really unfortunate that the state of XML modules in the Python standard library is so brittle and inconsistent. I hope that I and others can soon marshal the resources to clean things up in a coming Python version. If you (or anyone you know) are interested in contributing to Python and have a solid understanding of either Python or XML, consider contributing efforts toward PyXML, with the eventual goal of merging it more closely into the Python distribution.
Turning to those who are already active on the Python-XML front, Fred Drake announced the release of Expat 1.95.8, which fixes some minor bugs and adds support for suspend/resume. "Handlers can now request that a parse be suspended for later resumption or aborted altogether." See the announcement.
I made the first public release (0.5.0) of Scimitar, an implementation of ISO Schematron that compiles a Schematron schema into a Python validator script, making it a more efficient and somewhat more flexible approach than the usual XSLT implementations. See the announcement.
lxml, the alternative Python binding for libxml I mentioned in my
last article, has moved here. There is also an lxml mailing list. No meaningful postings yet, nor any packaged releases of
the code, but this is a project worth watching.
Eric van der Vlist announced his OSCON paper XML Driven Classes, which discusses an alternative XML data-binding he has been working on. Eric also tells me Guido had a lightning talk at OSCON about a Python/XML data-binding of his own, but I've been unable to find any more information on this. These days data bindings are sprouting like May blossoms. Soon we'll be at the point where we can start to consider consolidation, but for now, competition is good.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.